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METHOD FOR DESIG^fING DNA-BINDING PROTEINS OF THE ZINC-FINGER 

CLASS 

BACKGROUND OF THE INVENTION 

A superfamily of eukaryotic genes encoding potential nucleic-acid-binding proteins 
contains zinc-finger (ZF) domains of the Cys2-His2 (C2H2) class. Proteins that have these 

characteristic structural features play a key role in the regulation of gene expression[l-4]. 
Sequence comparisons, mutational analyses, and a recent crystallographic investigation have 
revealed that each finger domain, as a rule, interacts with the major groove of B-form DNA 
through contacts with some or all three base pairs within a DNA triplet. These base-specific 
interactions are mediated through amino acid (AA) side chains at specific positions in the a- 
helical region [5-10] of the protein domain. 

Mhough the AA sequences of more than 1,300 ZF motifs have been identified.^he 
exact DNA-binding sites are known only for a few proteins. The available information on 
DNA contact regions concerns mainly guanine-cytosine-rich strands [5-9] and fewer adenine- 
thymine-rich sites [1 1,12]. On the basis of experimental data, the first proposals for rules 
relating ZF sequences to preferred DNA-binding sites have been made [13,14]. However, no 
general rules for ZF protein-DNA recognition have been proposed. This is likely due to the 
fact that neither computer modeling [2,3,5] nor crystallographic analysis [7] have provided 
enough information on the overall structural variety in the ZF-DNA contact region. 

Using physical atomic-molecular models to characterize the steric conditions in the 
specific contact positions for different ZF-DNA interactions, an objective of the work leading 
to the present invention was to determine a set of general rules for ZF-DNA recognition for 
the C2H2 class of ZF domains. Once this objective had been reached, the work of the 

invention plan was to develop an algorithm, and a computer system using the algorithm, to 
design effective zinc-finger DNA-binding polypeptides. The achievement of these goals 
represents a major advance of knowledge in the field, knowledge characterized by the 



wo 99/42474 PCT/US99/03692 

disclosures of Rebar, et. al. and Beerli, et. aJ. [15,16]. These two disclosures are concerned 
with the selection, using the phage display system, of specific zinc fingers with new DNA- 
binding specificities. On the other hand, the present disclosure is concerned with the design of 
DNA-binding proteins for any given DNA sequence. 

SUMMARY OF THE INVENTION 

The invention is direaed to the design and specification of DNA-binding proteins 
binding via C2H2 zinc-finger motifs (DBF's or, individually, a DBF). On the basis of the 
studies described herein, general rules for optimizing such binding have been determined, and 
a formula describing the class of DBF's having optimal DNA-binding properties has been 
constructed. Furthermore, a program has been developed, based on the rules, which affords 
the design of DBF's with such high binding affinity for any given DNA sequence. Lastly, rules 
have been deiennined for the design of DBF's which, while not having optimal binding, do 
have significant and usefiil DNA-binding properties. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 depicts the alignment of ZF domains in various known DBF's. 
Figure 2 is a schematic representation of the interaction between a target DNA triplet 
and a single ZF domain. 

Figure 3 is a schematic representation of the interaction between a target DNA string 
of 9 bases and a three-domain DBF. 

Figure 4 is a block flow diagram of the computer system by which the instant DBF 
design process is implemented. 

Figure 5 is a block flow diagram wherein the Computer Frogram block (2) of Figure 4 
is fiirther broken down. 

Figure 6 is a block flow diagram wherem the Frocess Genome into Blocking Fragment 
Files block (2) of Figure 5 is fiirther broken down. 

Figure 7 is a block flow diagram wherein the Design DBF's for a Genome block (3) of 
Figure 5 is fiirther broken down. 
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Figure 8 is a block flow diagram wherein block (22) of Figure 7 is further broken 

down. 

Figure 9 is a block flow diagram wherein block (24) of Figure 7 is further broken 

down. 

Figure 10 shows the distribution of binding strengths of acceptable 9.finger DBF's 
across the yeast genes analyzed. 

Figure 1 1 shows the values of the binding energies of the acceptable 9-fmger DBF's 
found for the yeast genes analyzed. 

Figure 12 shows the distribution of DBF subsite (spurious) binding energies across the 
yeast genes analyzed. 

Figure 13 shows, in nonlogarithmic fashion; the distribution depicted in Figure 12. 

Figure 14 shows the ratios of binding energy to subsite (spurious) binding energy, 
across the yeast genes analyzed, for the acceptable 9-finger DBF's. 

Figure 15 shows the values of the spurious binding energies for each of the 27-base. 
pair (bp) frames of the 300-bp promoter region of yeast gene YAR073. 

Figure 16 shows the ratios of binding energy to subsite (spurious) binding energy for 
each of the 27-base-pair (bp) frames of the 300-bp promoter region of yeast gene YAR073. 

Figure 17 shows the distribution of sizes of acceptable DBF's across the C elegans 
genes analyzed. 

Figure 18 shows the ratios of binding energy to subsite (spurious) binding energy, 
across the C. elegans genes analyzed, for the acceptable DBF's. 
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DETAILED DESCRIPTION OF TEIE INVENTION 

The general rules governing the binding of C2H2 ZF motifs to DNA were developed 
by using a combination of the database analysis of the homologies between 1,85 1 possible ZF 
domains and physical molecular modeling of the interaction of a DBP model with a DNA 
model containing all 64 possible base-pair triplets. The DBP model approximates the size and 
shape of a half-gaUon jug of milk. The DNA model is approximately four feet long and one 
foot in diameter. The axis of the DNA model is horizontal and can be rotated to observe each 
of the 64 base-pair triplets. By moving the DBP model in and out with respect to the DNA 
model one can observe the amino acid and nucleic acid contacts. 

.Although the following description details the scientific precedents of this invention, 
the completeness of the rule set goveniing the DBP-DNA interaction could have only been 
obtained by the continual, derivative interplay of data base analysis and physical modeling 
during the invention period. Observations as to the conservation and variability of amino acids 
at various places in the ZF motif were embodied, first, by constructing a physical model of the 
ZF motif and, then, by physically modeling the interaction of a specific DBP with a designated 
DNA bp triplet. The physical modeling indicated patterns of amino acid and nucleic acid 
interartion which led to further analysis of the database. Iterations of this interplay between 
database analysis and physical modeling enabled conceptual refinement and expansion of the 
nature of contact patterns. As these patterns emerged, systematic variation of the amino acids 
in the ZF motif was undertaken for each of the 64 base-pair triplets. The physical modeling of 
the interaction between a DBP and DNA was efficient because alternative amino acids could 
be easily introduced into the ZF motif and the resulting protein physically modeled against the 
DNA. Hydrogen bonding, and water and hydrophobic contacts could then be modeled, clearly 
determined and counted very quickly. From this physical modeling a general set of rules was 
developed which incorporates criteria for the design of DBF's that specifically interact with 
DNA. 

The utility of ZF sequence analysis and alignment is illustrated by Figure 1. The 
TnnA protein is Avidely used as a model for ZF proteins both in terms of physical 
measurement and modification and theoretical dau analysis. For each of the nine zinc-finger 
domains the TFIIIA amino acid sequence in this figure has been aligned so that the zinc- 
binding amino acids, the two cysteines (CYS) and the two histidines (HIS), are aligned in four 
columns. In order to achieve this alignment dashes must be inserted into the sequence at 
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various places to provide for domains which have additional amino acids. The same type of 
alignment has been done for ZF protein MKR2 and the Kruppel proteins. The MKR2 
sequence alignment is very compact; there is no need for any insenions, since all of its ZF 
domains are of the same size. Compared to TFUIA, MKR2 acts as a much more uniform 
model for studying the imeraction of the amino acids of the protein with the bases of the 
specific double-stranded DNA. To arrive at the present invention, MKR2 has been used 
exclusively as the sequence basis for deducing the general rules which govern DBP-DNA 
interactions. 

The crystallographic analysis of a complex containing three ZFs from ZF protein 
Zif268 and a consensus DNA-binding site helped identify the localization of ZF-B-DNA 
recognition sub-sites [7]. Because the mutagenesis and sequence investigation results are in 
accordance with crystal structure data, it is reasonable to expect that the same comact regions 
also panicipate in the interaction of other ZF-DNA complexes [5,6,8-1 0]. Thus, it has been 
assumed that the following ZF components of a ZF protein play a key role in the anti-parallel 
DNA reading process: 1) the AA immediately preceding the a-helical region of the protein; 2) 
the third residue within the a-helical region, i.e., that immediately preceding constant leucine; 
and, 3) the sixth residue of this region, i.e., that immediately preceding invariant histidine. 

These components are indicated below as Z3 Z2 and Z|, respectively, in the 
generalized ZF sequence (a-helical and b-structural regions are underlined) given in Formula I: 

Y^F2LCX^ CGZD K/R X F X Z . XXZ,LX7, H X3.5 H T/S G/E X0.2 E K/R P 
P-structural region a-helix 

Formula I 

wherein X is any amino acid; Xj^is a peptide 2 to 4 amino acids in length; X3.J is a peptide 3 
to 5 amino acids in length; X0-2 is a peptide 0 to 2 amino acids in length and C, D, E, F, G, H, 
K, L, P, R, S, T and Y designate spedfic amino acids according to the standard single-letter 
code Pairs of letters separated by "/" indicate that the position can be filled by either of the 
two specific amino acids designated. 
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Keeping in mind the above formula, one can envision the formation of antiparallel, 
trinucleotide-peptide complexes with three (first, second and third) contact positions as 
follows: 



5'-N,-N2.N3-3' 

COOH-Z, -Z2-Z3-NH2 

The crystallographic investigation of the Zif268-DNA complex also gave indications of 
the way the contact groups interact. Pavletich and Pabo [7] concluded that Zif268 forms 1 1 
critical hydrogen bonds (H-bonds) with the bases of the coding DNA strand in the major 
groove. Two arginine residues in the first contact positions (see the designations of poshions 
above) make H-bonds with the N7 and 06 atoms of the guanine. Three arginine residues 
hydrogen bond in the same way with guanine in the third contact position. In addition, each 
arginine residue in this position forms lateral H-bond, salt bridge interactions with caiboxylate 
groups of aspartic acid occurring as the second residue in the a-helbc. The N5 atom of the 
histidine residue in the middle contact position of the second ZF of Zif268 donates an H-bond 
to the N7 or 06 atom of guanine. The role of arginine and histidine residues in the interaction 
with guanine in ZF polypeptide-DNA complexes is confinned by experimems of direaed 
mutagenesis [5,6,9,14]. The crystallographic investigations of DNA-binding domains of 
lambda and phage 434 repressors, complexed with corresponding operator sites, revealed that 
guanine can also be H-bonded by lysine, asparagine, glutamine and serine residues [17.18]. 
No doubt, the remaining polar AA's - threonine and tyrosine ~ are able to form analogous 
bonds with guanine. 

In fingers 1 and 3 of the Zi£268-DNA complex, the second (middle) critical position is 
occupied by a glutamic acid that does not contact the cytosine at the corresponding region in 
the DNA [7]. However. ZF protein-DNA binding assays have shown that m natural binding 
sites this interaction does occur with both glutamic acid and aspartic acid [5,6,9,14,19]. 
Desjarlais and Berg [14] proposed an H-bonding formula for the imeraction between cytosine 
and aspanic acid. The authors emphasized that the preference for aspartic or glutamic acid m 
the interaction with cytosine depends on the presence of glutamine or arginine in the third 
contact position (Zj). and serine or aspartic acid in the second position (Z2). The mutagenesis 
experimems of NardelU et al. [5] reveal that cytosine can interact with a glutamine residue. 



wo 99/42474 PCTAJS99/03692 

This may also be true for asparagine. which has similar polar groups. Cytosine should also be 
capable of making an H-bond with the hydroxyi oxygen atom in serine and threonine residues. 

Thymine in the Zif268-DNA complex does not seem to participate in the recognition 
process. However, the crystal structure investigations of the lambda repressor, DNA-binding- 
domain DNA and engrailed homeodomain-DNA complexes, as well as ZF protein-DNA 
binding assays, demonstrate that thymine can make both hydrophobic contaas with non-polar 
residues (alanine, leucine, isoleucine, valme) and H-bonds with polar AA's (lysine, arginine, 
glutamine) [8,11,14,17,20]. 

The X-ray crystallographic studies of lambda and phage 434 repressor, DNA-binding 
domain complexes with corresponding operator sites revealed that an adenine base forms two 
H-bonds to glutamine: 1) the amide NHz-group of the glutamine side chain donates an H-bond 
to the N7-atom of adenine and 2) the amide 0-atom accepts an H-bond from the N6 atom 
[ 1 7, 1 8] . Similar H-bonds have been found between adenine and asparagine residues in the 
two homeodomain complexes [20.21]. ZF protein-DNA binding assays also indicate, that in 
ZF contact positions, adenine makes strong interactions with both glutamine and asparagine 
[8, II, 12, 14]. Considering that glutamic and aspartic acid carboxylic groups have 0-atoms 
capable of accepting H-bonds as do glutamine and asparagine amide 0-atoms, one may 
suppose that adenine can form a single H-bond with both glutamic and aspartic acid. Indeed, 
Letovsky and Dynan [ 1 9] have shown in a directed mutagenesis investigation that 
transcription factor Spl. containing a glutamic acid residue in the central contact position of 
the ZF. binds only 3-fold more weakly to the adenine-substituted variant (-GAG-) than to the 
wild consensus recognition site (-GCG-). In addition, Desjarlais and Berg [14] and Berg [8] 
think it probable that adenine can (like guanine in the Zi£268 -DNA complex) make one H- 
bond to a histidine residue. It is likely that not only histidine but also other polar amino acids 
(arginine, lysine, tyrosine, serine and threonine) are capable of forming an H-bond to atom N7 
of adenine. 

A database of potential ZF protein domains, containing 1,851 entries, has been 
assembled. This database was used computationally to observe the homologies between the 

ZF dom ains 

Several years ago Seeman et al. [22] concluded that a single H-bond is inadequate for 
uniquely identifying any particular base pair, as this leads to numerous degeneracies. They 
proposed that fidelity of recognition may be achieved using two H-bonds, as occurs in the 
major groove when asparagine or glutamine binds to adenine, and arginine binds to guanine. 
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On the basis of the above-given results, it was reasonable to test, using the models 
described herein, base recognition at the ZF contact positions of the following AA's: 

1) guanine - R, H, K, Y, Q, N, S, T; 

2) cytosine - E, D, Q, N, S, T; 

3) thymine - 1, L. V, A, R, H, K, Y. Q, N, S, T; 

4) adenine - Q, N, E, D, H, R, K, Y, S, T 

Plastic space-filling atomic-molecular and ionic models [23,24] have been used to build 
ZF-DNA complex imitations. These molecular models were chosen due to the extraordinary 
firmness of their connectors, their convenient scale (1cm = 1 A = O.lnm) and their improved 
theoretical parameters which were very suitable for the modeling of macromolecules. New 
modules of tetrahedral carbon atoms, with bond angles 100" and 105°, dihedral oxygen atoms 
(120°) and tetrahedral phosphorus atoms (102° and 118°). maintained the exact modeling of 
deoxyribose puckering and sugar-phosphate chain conformation in the B-DNA model. 
Peptide bonds in the DBP models were imitated by the fixing, to each other, of special 
modules of carbon atoms (bond angles 116°, 120.5° and 123.5°) and nitrogen atoms (122° and 
1 1 9°). The zinc ion was represented in the model by a sphere (R = 0.85 cm) fixed 
tetrahedrally to N and S atom modules of ZF histidine and cysteine residues. A long 
horizontal 34-base B-form DNA model with laterally-fixed DBP models was used for docking 
experiments. 

In the first stage of the subject investigation, the models of Zif268 fingers 1, 2 and 3 
were assembled, and the general spatial orientation of the ZF-B-DNA complex was observed. 
In the second stage, the steric fitness of all 64 nucleotide triplets to the diflFerent combinations 
of the above-mentioned AA's in the critical positions of the ZF-DNA complex was modeled. 

A plastic molecular model of the Zif268 peptide-DNA complex was assembled on the 
basis of crystallographic data [7]. After the imitating of ZF-DNA backbone contacts and H- 
bonds between AA and bases in the major groove, it was confirmed that the overaU 
arrangement of Zif268 is antiparallel to the DNA strand. The most steady ZF-DNA, 
nonspecific imeraction seems to be the H-bond between a phosphodiester oxygen atom and 
the first invariant histidine residue fixed to the Zn'* ion. A conserved arginine on the second b 
strand also contacts phosphodiester oxygen atoms on the primary DNA strand. However, 
fingers 2 and 3 of Zif268 contact equivalent phosphates with respett to the 3-bp sub-sites, 
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whereas the finger- 1 H-bond is shifted by one nucleotide. Another four ZF-DNA backbone 
contacts made by arginine and serine residues are even more irregular in relation to the ZF 
modular structure. 

All 1 1 critical H-bonds found in the 2if268-DNA crystal complex have been observed 
in the plastic models. As expeaed. the threonine residue in the first contaa position of the 
second finger was too far fi-om thymine to make an H-bond. However, differing fi-om the 
results of crystal structure analysis, the model investigation clearly indicated the possibility of 
hydrogen bonding between a glutamic acid residue and cytosine in the second contact position 
of fingers 1 and 3. 

It is noteworthy that, of the six guanine- AA contacts in recognition positions observed 
in the Zif268-DNA crystal structure, five were made with arginine and only one with histidine. 
It is even more interesting that this histidine-guanine interaction was the only one in the 
central-specific position. Considering the smaller size of histidine in comparison with arginine, 
it may be supposed that the middle position has steric constraims prohibiting contact between 
guanine and the larger arginine residue, although, due to its capability of forming two H- 
bonds, the latter pairing should be energetically favored. 

To investigate the spatial conditions in diflFerent recognition positions, a B-DNA model 
was built which contained, in the primary strand, 1) the triplet GGG. and 2) models of 2F a- 
helical protein fi-agments (including the AA immediately preceding the a-helix) with a) side 
groups of the first Zn-binding histidine and b) groups for critical AA triplets RiRjRa and 
R1H2R3. The models of a-helical fi-agments were fixed to the B-DNA model by an imitation 
of an H-bond joining a phosphodiester oxygen atom with a histidine residue. Specific base- 
AA contaas were then tested in these complexes. It was elucidated that only the complex 
GGG-R1H2R3 contains the contact groups in positions corresponding to the distances of 
critical H-bonds found in the Zif268-DNA crystal structure. The complex GGG-R.RjR, is 
stericaUy unfavorable; molecular modeling reveals that, although in the outer contact positions 
guanine and arginine can be joined by two H-bonds, in the middle position such a pair cannot 
be included due to the limited space. 

Observations derived fi-om the physical models confirmed the supposition of steric 
constraints for some AA-base contacts in tiie central contact position. In the case of the 
complex G,G2G3-RiH2R3, the following approximate distances fi-om guanine N7 and 06 
atoms to Uie C, atoms of corresponding AA's have been determined: G,N7-R,=7A, G,06- 
Ri=8A, G2N7-H2=5.5A, G2O6 H2=6.5A, G3N7-R3=8A and G306-R3=7A. 

9 
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Using the models, the investigation of B-DNA and a-helix basic structure elucidated 
the molecular basis for steric constraints in the second ZF-DNA recognition position. 
Joining, by a straight line, the analogous atomic groups (for example, N7 atoms of guanine) of 
the first and third base in the DNA triplet in the major groove results in the corresponding 
group of the middle (second) base being distanced from this line by about 1 .5 A. Similarly, 
joining the C, atoms of the AA's in the first and third contaa positions of the ZF by such a line 
results in the C„ atom in the middle position also being at a distance of about 1 .SA. Thus, the 
space allowed for a critical AA in the middle contact position is compressed from both sides 
approximately 1.5 A. 

Analysis of the above-given data on the ZF-DNA backbone contacts, as well as 
observations derived from the models, led to the conclusion that there are considerable 
differences in spatial conditions between first and third ZF-DNA recognition positions. In 
the first position the C„atom of the AA is distanced about 6.5A from the phosphodiester 
oxygen atom where the ZF protein is fixed to the DNA backbone by the invariant histidine 
residue. Due to the steady fixing of this ZF a-helical part by histidine, the freedom of 
conformational rearrangements in the first contact position is limited: the C„atom, with 
corresponding side chain, can be moved 2-3A "up and down" in the plane of the base where it 
is localized in the primary DNA strand or, alternatively, 1-2 A perpendiculariy to this plane. 

On the other hand, the fixing of the N-terminal end of the ZF a-helical region to the 
DNA backbone seems to be rather loose and variable, therefore allowing relatively large 
rearrangements for the Ca atom and the corresponding AA in the third contact position. The 
latter contact position is favored by the fact that the C„ atom in this position is more distant 
from the main fixation place (about lO.SA from the phosphodiester atom bound to the 
histidine residue), and the corresponding AA in this position is not a pan of the a-helix. The 
most imponant finding is that, due to the above-described circumstances, the critical AA in the 
third contact site can apparently occupy very different positions in the correspondmg bp plane. 
This means this residue may, in certain complexes, be very close to the base of the 
complementary DNA strand. One of the reasons for the appearance of such a geometrical 
configuration is that the typical, right-handed helical twist of B-DNA makes the 
complementary base on the nucleic acid second strand in the third contact site even more 
accessible than the main base on the primary chain. Molecular modeling clearly shows that in 
the third, and also partially in the second contact position, this DNA strand is capable of 
panicipating in the ZF-nucleic acid recognition process. In the Zif268.DNA crystal complex, 
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the a-helix of each ZF domain, which is bound only to the DNA primary strand, is tipped at 
about a 45° angle with respect to the plane of the base pairs [7]. In cases wherein the second 
DNA strand, via critical H-bonds involving the third and second contact positions, is involved 
in the reading process, the direction of the a-helix axis should be even more perpendicular to 
the base pair plane. 

Thus, this more detailed investigation of ZF-DNA-complex imitations, through use of 
physical molecular models, shows that steric conditions in each of the three contaa regions 
are different. These steric conditions are reflected in the ZF-DNA recognition rules. 

On the basis of information obtained above, which yielded a general observation of 
steric conditions in the ZF-DNA recognition process, an extensive model study of various AA- 
base combinations in the critical contact positions was undertaken. The results of this 
investigation are presented both as the ZF-DNA reading code and main rules for recognition 
(Tables 1, 2 and 3). The rules are in good accordance with crystallographic, directed 
mutagenesis, DNA-binding and sequence analysis data. 

With reference to the sequence of Formula I and the 2-dimensional structure diagram 
in Figure 2 (which provides a schematic representation of a zinc-finger domain and its 
interaction with a DNA strand), the studies confirmed the identity of the three critical contact 
positions in a given zinc-finger domain as follows: 

1 ) between the first nucleotide in the triplet and the first AA preceding the 
constant histidine at the COOH end of the a-helix; 

2) between the second nucleotide in the triplet and the fourth AA preceding the 
constant histidine at the COOH end of the a-helix; and, 

3) between the third nucleotide in the triplet and the seventh AA preceding the 
constant histidine at the COOH end of the a-helix. 

Steric conditions in the three contact sites of the 2T-DNA recognition complexes are 
different. The first contact position is relatively large and strictly fixed, which enables the 
binding of a longer AA to bases on the primary DNA strand with sufficient specificity and 
affinity. The second position is compressed and can accommodate smaller AA's with 
somewhat lower specificity and affinity. The third position allows considerable conformational 
rearrangements including the contacts with the complementary DNA strand. 

In Table 1, for each nucleotide of a given DNA triplet on the primary strand, both main 
(Column A) and alternative (Column B) base-binding AA's are presented. Both specificity 
and affinity were considered in including a residue in Column A. As was proposed already by 

11 
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Seeman et al. [22], the fideUty of recognition is better maintained, in the case of purine bases 
(guanine and adenine), because they occupy a greater portion of the major groove and offer 
more hydrogen bonding sites than the pyriraidines. Therefore, the strongest AA interactions 
appeared to be those of arginine, glutamine and asparagine. each binding by two H-bonds to 
either guanine or adenine. The affinities of aspartic acid, glutamic acid, asparagine and 
glutamine were frequently enhanced by the formation of water bridges between carboxylate or 
amide oxygen atoms and DNA backbone, phosphodiester oxygen atoms. Although van der 
Waals interactions are relatively weak, they can play a certain role in recognition of the 
thymine methyl group by hydrophobic AA's (alanine, valine, leucine and isoleucine). 

As indicated in Table 1, in many ZF-DNA complexes the base recognition in the 
nucleotide triplet of the primary DNA strand occurs not entirely via the primary strand, but by 
binding simuhaneously to both the primary and complementary strands, or even exclusively to 
the complementary strand. Without "help" from the complementary DNA strand, the binding 
of critical AA's to nucleotides of the primary DNA chain would be too weak, in the case of 
several triplets, to realize the recognition process. All possible AA replacements were tested 
for strength of interaction in the Z1-Z3 positions. Domains with fewer than 2 hydrogen bonds 
on the primary strand were considered to be unstable. 

Table 2 presents the ZF AA triplets having the highest affinity for interaction with 
corresponding DNA triplets. These ZF triplets contain only the main residues presented in 
Column A of Table 1. Table 2 also presents the binding energy components (H-bonds, water 
bridges, van der Waals interactions) maintaining the ZF-DNA recognition process in specific 
contact regions. 

As can be seen from Table 2, the participation of the complementary DNA strand in 
the process of ZF binding, combined with the number of interactions (H-bonds, water bridges 
and van der Waals interactions) possible in the three contact regions, when optimal 
combinations are used, makes it possible to show that a complex formation with all 64 DNA 
triplets can be achieved. Table 2 shows that the maximal number of H-bonds, the strongest of 
the three types of interactions, is obtained when the first nucleotide of the triplet is guanine or 
adenine. 

In nucleotide triplets wherein the number of H-bonds possible is less than maximal, the 
deficiency is often partially compensated by a significant amount of water-bridging between 
critical AA's and the sugar-phosphate backbone. 
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Even in cases wherein the first nucleotide of the triplet is thymine, and the number of 
the H-bonds is lowest, 1) the formation of two H-bonds between the AA in the Z3 position, 
and the adenine and complementary thymine in the third contact position, and 2) probably, a 
single H-bond between thymine and serine or threonine in the second contact position, means 
that even TTN triplets can bind a ZF protein with sufficient affinity. 

In any event, to obtain DBF's of the greatest effectiveness, attention should be paid to 
having the strongest interactions in the flanking contact points (1 and 3). If weaker 
combinations must be used, they would have less effect if positioned in the center contact 
point (2). It is important to note, however, that even weak binding in the contact points is 
important for establishing specificity. 

Table 3 presents the main ZF AA triplets of Table 2, as well as the alternative AA's 
(shown in Column B of Table 1) which would be also expected to provide effective binding to 
the respective bases of a given DNA triplet. Table 3 also presents the binding energy 
components (H-bonds, water bridges, van der Waals interactions) maimaining the ZF-DNA 
recognition process in specific contact regions. 
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Table i . 21 



Codon Z1 

Cotumn A 

AAC Qs 
AAG 

AAT Qs 

ACC Qs 

ACT Qs 

QAA Rs 

QAC Rs 

GAG Rs 

OAT Rs 

OCT Rs 

OCT R- 

ACA Qs 

ACG Q« 

AGA Qs 

AGG Q= 

CAA E' 

GAG E* 

CAT E* 

CCC £• 

CCT E» 

OCA R= 

GOG R= 

QGA Rs 

AAA R./K- 

AOC Os 

ACT Q=: 

QCC Rs 

GOG Rs 

GOT R>= 

CAC 

CCA E* 

COG E- 

CGA £• 

OQC E* 
COG 

OCT E* 



ATA 
ATC 
ATG 
ATT 
GTA 
OIC 
OIG 
GTT 
TAA 
TAG 
TAT 
TCC 
TCT 



Qs 

Os 
R« 
Rs 

n» 

If/Lf 
l«/Lf 
f«/Lf/V« 
t#/Lff 



Zl 

Cotumn 6 

E VR , ,-/N , =/0 , V( H -/ Y./S-/T-) 

EVR^K- 

R-/K-/E' 

EVK,- 

E-/R,^K,- 

K./H,-/Y,. 

H-/K-nr-/o. 

H-/K./Y-/Q* 
H./K-/0-/N-/(Y-/S-/T-) 

H-/K-/Y-;o- 

R-/K-/N-/E-/D- 
R-/K-/N-/EVD* 
EVRWKf 
EVR,-/K,- 

QVN,VD,VR3./KWY,-/S,-/T,. 

QVNVDVR,-/K,- 

QVRj-/K,./(DVNVS./T-/Y,-) 

QVN,VD,VR,=/K,./Q,-/N,* 

QVRjVK,- 

K-/Q-/(H-/Y-/N-yS-/T-) 

H-/K-/Y./Q-/N-/S./T. 

H./K./QVW/Yt- 



R-/K-/E* 

EVRWK,- 

K./QV(H-nr./N-) 

H-/K./Y-/QVN* 

H./K-/Y-/Q-/N- 

QVRWK,- 
QVRWKj- 
QVNVDVR,3/K,. 
QVNVDVRWK,- 

QVNVOVR,-/K,-nr,-/Q,-y(S-/T-) 

QVNVOVRr-/K,./Q,e 

Q*/D,*/R»«/Kv-/Qt- 

E*/Rf/Ki- 

N-/EVOVRf-/K,- 

R-/K-/E-/(H*) 

EVR,-/Kt- 

H-/K-/Y*/0- 

H./K^-/Q» 

K-/Q-Alf 

H-/K-/Q-/N-nrt- 

R-/K-/Q-nr,i 

R-/K-/Q* 

R-/K-rir./Q^. 

R-/H-/K-/Q*/NVVi/A« 

R-/H-/K^Q* 



Hydrogen W«ter Hydrophote 
Bonds Contacts Contacts 



6 
6 
6 
6 
6 
6 
.6 
6 
6 
6 
6 

5 
5 
5 
5 
5 
5 
5 
5 
5 
5 
5 
S 

5 
5 
5 
5 
5 
5 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 

2 
2 
2 
2 

1 
1 
1 

0 
0 
• 0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

0 
0 
0 
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TaWe 1 -Zl 



CTA E* Q* 

CTC B* QVN,./Dt-/R,=/KWQf/N,- 

CTG E* QVNVDVR,./K,- 

CTT E- QVW/DVn,-/K,-/Q,« 

TCA lf/L# H-/K-/Q* 

TCG l#/Lff R-/K*/Q* 

TGA l#yL# R-/K-/Q* 

TAC U;L0/V« R-/K-/Q* 

TBC l,#/Lt#/V,# R-/K-/H,-/Q,VN,VS,-nr,. 

TOG li/L* R-/K-/Q* 

TGT U R-/H-/K-/QVNVL* 

7TA l#/L# R-/K-/QVN' 

TTC Iff/Lff/Vff R-/K*/QVN* 

TTG I#;L# R-/K-/Y-/Q* 

TTT I#/L# . R./K-/QVN»/V,« 



3 
3 
3 
3 
3 
3 
3 

3 
3 
3 
3 

2 
2 
2 
2 
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Legand 

«vh«ra y separates alternative amino adds 

where X without subscript has all Its interactions with the primary strmd 

where Xt has some Interactions with the primary strand and some Interactions with the 
complemeniary strand 

where X2 has Interaction with the complementary strand 
where Xjhas Interactions with both the primary and complementary strands 
where - is one hydrogen bond between the amino acid and the base 
whare s Is two hydrogen bonds between the amino add and the baae 

wtiere * Is one hydrogen bond via a water bridge between the amino add and the phosphodlester 
oxygen atom 0I the baclctxine 

where # is one or more van der Waals contacts t>etween the amino add and the base 

where amino acids In ( ) have interaction with the base of the primary strand where one of two 
other possible proteln-DNA recognition Interactions is absent 



1 
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Table 1 -22 



Codon Z2 



Z2 





Column A 


Column B 


AAC 




Na/OVS-/T-/n,-/K,-/E,V(H-/V-) 


AAG 


Qs/N= 


R-/H-/K./EVDVK,- 


AAT 




K-/Qa/E*/n*/B ./H .lit «./M . 


ACC 






ACT 






GAA 


Ns 




GAC 


Nb 


D VR 1 -/H • -/K* -/ Y . -/O . s/P . 


GAG 


Ns 


^ # 1 ' n f *f w^*/ 1 -f *# A 2* 


OAT 


Qs/Ns 




QCC 




Q*/N*/D*/S-/T./n- yH_ /If /ft /u /e rr 


GCT 


N3B/O* 


Q-/N-/E-/S./T-/W -/If ./M •/A .//D —/V I 


ACA 


0* 


OVNVEVS-/T-/K,- 


ACG 


EVD- 


QVNVS-/T-/R,-/H,-/K,-/Y,-/Q,=/N,= 


AGA 


W 


Rt-/H,./K,./Y,-/Q,* 


AGG 


QVN' 


••1 '••1 /•X'l" 


CAA 


N = 


n*/Q«/Tr«/D ^/u itf iv 

w f 9*/ 1 •/rii*/nt-/Kt*/Yi*/Qf b/Ei* 


GAG 


Q= 




CAT 


Qis/Ns 


n-/^*rr_/Q _/u tt^ /V /E 

u-/a-/ 1 -/Hf/Mt''Kf-/Yf/E|=/Kj*/Q,=/Nj= 


coc 


Qa=/Ei' 


n !U /5*/T-/H}-/Kf-/N}S/(Ya-) 


ccr 


N,=/D* 


QVNVEVS-/T./H,-/K,./Q,= 


GCA 


0* 


QVNVEVS-/T./R,./H,-/K,-/Y,-/Q?=/N,= 


GOG 


EVD* 


Q VN VS ./T-/K ,-/Q , -/N ,«/( H ) 


UUA 


N* 


QVS-/T-/K,-/(R./Y-/HO 


AAA 


N= 


DVR,JHt7Kt7Y,7Q,=/E|* 


AGC 


Hf 


Q VN VS -/T-/R , -/K , -/( R =/Y-) 


ACT 


Hi- 


QVNVS-/T-/R,=/Kt- 


GQC 


H,- 


N VS-/T./Kt./( R-/K-/Y./Q*) 


GGG 


H- 




GGT 


H. 


QVNVS-/T-/R1./K,- 


CAC 


Ns 


DVR,-/K,-/Q,-/E/ 


OCA 


D* 


NVS-/T-/Q,VE,VK,-/N,« 


GOG 


D* 


QVNVEVS-/T./K,-/Q,a/N,a/(R./H-/Y-) 


OQl 


W 


QVS-/T./R,-/HWK,-/(Y.) 


OGC 


Hi, 


QVNVS-/T-/K,-/(R,Bnr.) 


OQG 


H- 


K-/QVNVR,-/(Y-) 


OGT 


u 
n- 


aVNVR,-/K,-/{Y-) 


ATA 


lf/Lt/Vi/A# 


S-/T-/Kt-/a,VN,V(R-/H^-) 


ATC 


lt/L« 


NVS-/T-/V./R,-/H,-/K,-/Yt-/Q//R,./K,./Qt=:/N,«/E,VD,- 


ATG 


lt/L«/Vf 


S-/T./E,./0,./Q,«/N,-/(R-/H-/K-/Y.) 


ATT 


l«/Lfl/V« 


R-/H-/K-/Y*/Q*/NVS^- 


OTA 


lf/Lf/V</Af 


QVNVS./T-/R,./Hf/K,./E,VD,VQ,«/N,.=/(Y.) 


xnc 


lt/L« 


N-/S-/T-/V./RWHi-/K,-/Q,* 


cnG 


lt/Lt/V« 


QVNVS./T./Ht-/K,- 


□TT 


It/Lf/Vf 


QVN-/S-/T-/Ki./(R-/H./Y^A-) 


TAA 


Nor 


DVRi./H,-/K«-/Y,-/Q,«/E,. 



Hydrogen Water Hydrophobic 
Bonds Contacts Contacts 



6 
6 
6 
6 
6 
6 
6 
6 
6 

5 
5 
5 
5 
5 
5 
.5 
5 
5 
5 
5 
5 

5 
5 
5 
S 
5 
5 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 

2 
2 
2 
2 

1 
1 

1 

0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

0 
0 
0 
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Table 1 -Z2 



TAG N= Q=/EVDVRf/Kt-/K,- 

TAT N= K-/Q=/N=/EVDVS-n'-/H,-/H,-/K,./Q,s/N,=/(R-/Y-) 

TCC Q,a/E* QVNVDVS-/T-/R,-/H,-/K,-nr,-/Q,-/N,- 

TCT N,a/D' Q*/N"/EVS-/T-/R,n/Hj-/K,./YWQ/ 



CTA i#/L«/Vi/Atf S./T./Q,c:/N,W(H-/K-) 

CTC »#/Li S-/T-/V./H,-/H,-/Ki-/Y,./Q,VN,* 

CrC lf/L#/V# NVS-/T./K,-/Qt* 

CTT If/Li QVNVS-/T-/V«/R,./H,-/Ki./EWDWQ»=/N,=/(Y.) 

TCA D ♦ Q VNVEVS-/T-/R,«/H ,-/K,-/ Y,-/Q j«/N ,= 

TCG QVNVEVS-/T-/R,-/H,-/K,-/Ya-/S,-/T,-/Q,o/N,« 

TCA N* QVS-/T-/H,-/K,-/(R-/Y-) 



TAG 
TCC 
7GG 
TGT 



Hr 
H- 

H,- 



D-/H,-/K,-/Q,=yE,VK,-/(R-/Y-) 
QVNVS-rr./R,=/K,-/Yr 
R./K-/QVMVY,- 
NVS-/T-/Kt-/Y,-/Qr/(R = ) 



TTA l#/L#/V#/A«> S-/T-/R,-/Hi./K,-/Y,-/E,-/D2-/Q,«/N,= 

TTC I#;L# N'/S-/T-/V-/A-/R,-/Hi-/K,-/Y,-/K,./0,=/N,= 

TTG l#/L#/V# NVS-/T-/H,-/K,-/Q/ 

TTT l«/L«/V« Q'/NVS-/T-/A-/H,-/Ki-/(R-/Y-) 



18 



wo 99/42474 



PCT/US99/03692 



Legend 

wrhere / separates alternative amino adds 

where X without subscript has all Its Interactions with the primary strand 

where X, has some interactions with the primary strand and some Interactions with the 
complementary strand 

vrhere X, has interaction with the complementary strand 
where X$ has interactions with both tho primary and comptementary strands 
where - Is one hydrogen bond between the amino acid and the base 
where s is two hydrogen bonds behweon the amino acid and the base 

where • Is one hydrogen bond via a water bridge behween the amino add and the phosphodlester 
oxygen atom of the tsackbone 

where « Is one or more van der Waals contacts between the amino acid and the base 

where amino adds in ( ) have Interaction with the base of the primary strand where one of two 
other possible protein-ONA recognition interactions Is absent 
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Tatti 1 . Z3 



CodoQ Z3 Z3 

Cotimtn A CotumnB 



Hy<»n»9tn Water Hydrophobic 
Bonds Conticts Contacts 



AAC ft,. 

AAG Rs 
AAT 
ACC 
ACT 

GAA 0« 

GAC OWE* 

GAG R= 

OAT Q}./Q,. 

GOC R», 

OCT Oss 

ACA 0= 



ACC n. 

AGA Qs 
AGG Ftfi 



CAA 
CAG 
CAT 
OCC 
OCT 
OCA 
OCC 



AAA 0= 
AOC 

ACT 0,1. 
GOC 

OQC Rs 

OCT Q^N^d 

CAC E* 

OCA Qs 

COG Rs 

CGA Os 

OQC 

OOQ Rs 

CGT O^N,. 

ATA 0« 

m R, 
ATT 

OIA Qc 

ore R,«/E' 

OTG Rs 

OTT Q4. 
TAA 

TAG R. 

TAT Qt. 

TDC R^ 

TCT Q^Nj. 

CTA Q, 

CIC Rfv/E* 



Y./0*/N'/EV0VMr7IC,-/Y,-/Q»VN,' 
H-/K-/Y-/Q-/Q,-/£,-/NaVD,VS,*n-j- 
R./H-/K-/Y-yQ«/R,yK,yO,^N,^E,VO,*/0,s 
QVEVK,-/N,VS,-/T,- 

R-/H-/K-/Y./Q*/R^/H^/IC,-/Y,-/N,«/E,»/D,' 

R-/H-/K-/EVR,./Hr/K,-/Yr/Q.' 

QVH,«/Hr/KrVY,-/Qi*/N,VSrAr 

K./QVO,*/E,VNWD,-/S,-rr,. 

R-/H-/K-/Y-/Q-/NVI-/R,./HWICr/Y,-/N,»/E,*/0,- 

QVNVE-y0VHW»C,^r/Q»"/N,VSr/TWQt«/N,« 

H-/K./Y-/Q/R,./Ht-/K,VY,-/N,VE,VD,V&r^,- 

R^H-yK./Y-/N=/E»/DVIi/Lf/Vfffl,./H,-/K,-/Y2./Q,»yM,V<Sr/T,-) 
H-/K-yY-/QVNVS-n'-/E,VD,VQ,cyN.= 

E-/R,.yH,-yK,-yY,-/QWN/yi,#yL,i/v,t/A,t 

K./Y.yQ,VN,*yE,'/D,- 

R-/H./K-/Y-/QayEVR,./H,-/K,-yY,-yQ,* 

H./K./Y.yQ-/N-/Q,VN,VE,VO,VS,-yT,. 

R-/H-/K-/Y./0*yD,VE,- 

Q*/EVH,.yK,-/Yy./0,VN,VSK/Tr 

n-yH.yK.yY-/Q-yHWKr-yY,-/N,VEWSf-n'r 
R-yH.yK.yY.yEVR,./H,./K,-/Y,.yQ,VN,vs,-rTj-/i,f/L,fyv,fyA,# 

H-/K-yYVQ-yE,-/Or/Sr^WQ,=yH,» 
N«yEVOVR,-/H,-/K,-/Yr/Qf*/N,'yi,#yL,iyv,i/A,i 

R-/K-yN./E-yD-/Rr/Hr/Kr/Y,-/Q,.>Nr/lit/L,# 

QVNVEVDVH,-yK,.yYWQ7VW,* 

R-/K-yY-/QVMVS^./liyL«yHWK,-/Y,-/N»=/E2-/0,« 

QVN*/EVOVH,-yK,-nr,-/Q,-yN,VS,-/T,. 

H./K.yY-yOVNVE,V0,'/S,./T,-/Q,=/N,« 

R-/K.yQVNVK,-/E,VO,'yS,^,. 

QVR,«/H,-/K,-/Y,-/Q,* 

EVR,-/H,.yK,-/Y,./Q,VN,*yS,-/T,-yi,t/L,i/V,iyA,« 
H-yK.yQ-/N-yE,VD,VOWN,= 

R-/H.yK.yY-yN=yEVDVR,.yHWK,-yY»-/QtV{N,vs,.n-,-yi,iyL,*yv,#yA,i) 

QVNVEVDVH,VKr-/YWOr/Nt- 

HVK./Y.;Q»/E,-/0,VS,./Tr/Qi«/M,. 

n-/IC-/Y.yQ-yN*/RWHWKWYr/E,'/D,VQ,. 

R-/KVE*/nWHr/K«VYrAt«iL,f/V,i/0«» 

Q*/Rt.nir/K«-nrrfO,*/N,* 

H'/K-rr<^Q^O>Vfl,*/E|-/0,* 

R-/HVK-/YWQ-/N«/nWH,VKr.yYWQ»«/N»«/E,*/D,VSWT,- 
R-/K^^NWEVIWRWHWK,-nfWOr/Nr/S^/TrVI,i/L,l/V,l/A,i 

ovK,-/K,.nrwQ,*/N,»/s,-n-^ 

H-/K^VQ*/Q,-/H,-/E,VO,*/Sr/T,- 

R^^-nf-/Q-/N»nf/L»/V#/RWHWICrA'WQ»=/N,-/Ern>,^ 

n-/H-«-nf./E*/HWHWICrn'WOi*iIt«/«-t«'V,»/A,t/Q,m 

HVlC-/YVQVNVO,*/NrVE,*/Dr 

R^MVKVY-yQ*/NVSVT-nt/Lt/V#/RWHr/Kr/Y,-/Q,«/HWE,VO,-/SWT,- 
Q*/E»/HrnCrrrr'Qt-/Hr/Sr/T,- 

IWHVK-nrVQVftrmWK,VYWEWDf 
R-/K-/E-mWKWQiVM//S,./TWI»«/Lfi/V,f/A^# 



6 
6 
6 
6 
6 
6 
6 
6 
6 
6 
6 

5 
5 
5 
5 
S 
S 
5 
5 
5 
5 
5 
5 

5 
S 
5 
S 
5 
5 



0 
0 
0 
0 
0 
0 

2 

2 
2 
2 

1 
1 
1 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

1 
1 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

0 

0 
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Table 1 - Z3 



CTG Rs K-/Y-/Qf/Ni*/EWDt* 

CTT R-/M*/K-/QVNVRr/H,-/K,-/Y,-/N^/E//0,' 

TCA Os Y-/E'/RWHr-/K«-/Y»-/Ny*/Sr/Ts-/l3«/Lf«/V,t/A«f/0,3/N,8 

TCC «= H-/K-/Y-/0'/Q,VNr-/E,*/0,-/S,-/Tr 

TCA 0= R-/H*/K-/Y*/Nn/EVD*/R,-/H,-/K,-/Y,./Of-/0|e 



TAC E- Q-/R,a/Hr/KWVWQi*/N,* 

TGC Ba- Q'/N*/EVDVHWK,-/Y,-/QtVNr 

7GG Rs H-/K-/Y-/Q-/NWEWDt-/&WTr 

TGT Qfc nVH-/K./Y-iQVIf/Li/flWHr/Ki-nrWMK^E,* 



TTA Oa R^H-/K-/Y./EVR,-/HWKr/Vt-/N,-/S,-/Tr/lt«/I-,i/V,f/A,f 

TTC RWE* Q'/NVD*/MWK,-/Y,.;Or/MWS,./T,- 

TTG B= H-/K-/Y-/0-/0,VN,VE»VO," 

TTT Q,» R-/H-/K-/Y/R,-/Hr-/Ki-/YWQi='N,=/Er/D,-/N,= 
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whsra / separates altematlva amino adds 

where X without subscript has all its Interactions with the primary strand 

where Xi has some Interactions with the primary strand and some Interactions with the 
complementary strand 

where X, has Interaction with the complemenUry strand 
where X, has interactions with both the primary and complementary strands 
where - Is one hydrogen bond Iwtween the amino acid and the base 
where s Is two hydrogen knnds beWveen the amino acid and the base 

wtiere • Is one hydrogen bond via a water bridge between the amino acid and the phosphodietter 
oxygen atom of the bacUione 

where # is one or more van der Waals contacts behnreen the amino acid and the bate 

where amino acids in ( ) have Interaction with the base of the primary strand where one of two 
other possible protein-ONA recognition Interactions is absent 
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Table 2 



Codon Z1 Z2 Z3 

Column A Column A Column A 

AAC Q Q R 

AAG Q N/Q R 

AAT Q N Q 

ACC Q E/Q R 

ACT Q D/N Q 

GAA R N Q 

GAC R N E/Q 

GAG R N R 

GAT . R N/Q Q 

GCX; R E/Q R 

GCT R D/N Q 

ACA Q P Q 

ACG Q D/E R 

AGA Q N Q 

AGG Q N/Q R 

CAA E N Q 

CAG E Q R 

CAT E N/Q N/Q 

OCC E E/Q R 

CCT E D/N Q 

GCA R D Q 

GCG R D/E R 

GGA R N Q 

AAA K/R N Q 

AGC Q H R 

AGT Q H Q 

GQC R H R 

GQ3 R H R 

QSr R H N/Q 

CAC E N E 

CCA E D Q 

OOQ E D R 

CGA E N Q 

OGC E H R 

COS E H R 

CGT -E H N/Q 

ATA Q A/l/L/V Q 

ATC Q l/L Em 



Hydrogen Water Hydrophobic 
Bonds Contacts Contacts 



6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 10 

5 1 0 

5 0 0 

5 0 0 

5 0 0 

5 0 0 

5 0 0 

5 0 0 

4 2 0 

4 2 0 

4 2 0 

4 2 0 

4 10 

4 1 0 

4 1 0 

4 0 1 

4 0 1 
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Table 2 



ATG 


Q 


I/L/V 


R 


4 


0 


ATT 


Q 


I/L/V 


Q 


4 


0 


GTA 


R 


A/I/L/V 


Q 


4 


0 


6TC 


R 


l/L 


BR 


4 


0 


GTG 


R 


I/L/V 


R 


4 


0 


GTT 


R 


I/L/V 


Q 


4 


0 


TAA 


l/L 


N 


Q 


4 


0 


TAG 


I/L 


N 


R 


4 


0 


TAT 


I/L/V 


N 


Q 


4 


0 


TCC 


I/L 


E/Q 


R 


4 


0 


TCT 


l/L 


D/N 


N/Q 


4 


0 



CTA E A/I/L/V Q 3 

CTC E l/L E/R 3 

CTC E I/L/V R 3 

CTT E l/L Q 3 

TCA l/L D Q 3 

TCG l/L D R 3 

TGA l/L N Q 3 



TAC 


I/L/V 


N 


E 


3 


0 


TCC 


I/L/V 


H 


R 


3 


0 


TOG 


l/L 


H 


R 


3 


0 


TGT 


1 


H 


Q 


3 


0 


TTA 


l/L 


A/I/L/V 


Q 


2 


0 


TTC 


I/L/V 


l/L 


E/R 


2 


0 


TTG 


I/L 


I/L/V 


R 


2 


0 


TTT 


l/L 


I/L/V 


Q 


2 


0 



where / separates alternative amino acids 
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The results of the molecular modeling analysis of various ZF a-helix complexes with 
the 64 different DNA triplets (Tables 1, 2 and 3), and the findings of spatial peculiarities in the 
three contact positions, are reflected in the ZF-DNA recognition rules. On the basis of the 
rules set forth in Tables 1, 2 and 3, DBF's with optimal binding aflRnity for any target DNA 
sequence can be designed. The "Column A" designations, i.e., the **A Rules " in Tables 1-3, 
show the amino acids with optimal binding for a given codon (triplet). The "Column B" 
designations, i.e., the "B rules." in Tables 1 and 3, show the amino acids with secondary, but 
still significant, binding affinity for a given triplet. 

The column A rules range from the strongest triplet recognition with six H-bonds, zero 
water contracts and zero hydrophobic contacts with an evaluated energy of (5x6) + (2x0) + 
(1x0) = 30 to two hydrogen bonds, zero water contacts and two hydrophobic contacts with an 
evaluated energy of (5x2) + (2x0) + (1x2) = 12. The Column A rules ordinarily have a choice 
of just one or two amino acids in positions Z|, Z2 and Z? . The column B rules, by 
comparison, have from three possible amino acids in each of theZi, Z2 and Z3 positions to as 
many as eighteen amino acids in different contacting arrangements in each of theZi, Z2 and Z3 
positions. In the evaluation of the column B energies, there are a large number different 
groupings of three amino acids in positions Zi, Z2 and Z3 . The minimum energy is three 
hydrogen bonds, zero water contacts and zero hydrophobic contacts with an evaluated energy 
of (5x3) + (2x0) + (1x0) = 15. The maximum energy evaluation for these combinations is, on 
average, three hydrogen bonds and either two water contacts or two hydrophobic contacts, 
with an evaluated energy of from (5x3) + (2x2) + (1x0) = 19 down to (5x3) + (2x0) + (1x2) = 
17. Thus, the column B rules have a narrower energy range (i.e., from 19 down to 15) than 
do the column A rules, which have an energy range from 30 dovm to 12. The narrow energy 
range for the column B rules means that the 64 different rules do not distinguish on the basis 
of energy as well as the 64 column A rules. 

For example, as set forth in Table 2, a DBF which binds optimally to the DNA base 
triplet guanine-cytosine-cytosine (GCC) is one wherein the portion of the protein responsible 
for the binding to the triplet is a ZF domain within which is contained a segment having the 
sequence Z3XXZ2LXZ1H, wherein Zi is an arginine which interacts with position 1 of the 
DNA triplet; Z2 is a glutamine or a glutamic acid which interacts with position 2 of the DNA 
triplet; Z3 is an arginine which interacts v^th position 3 of the DNA triplet; X is an arbitrary 
amino acid; L is leucine and H is histidine. 
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As set forth in Table I or 3 (see the "column B" entries for the Z,, Z2, and Z3 positions 
for a given codon), a DBP which effectively, if not optimally, binds to the DNA base triplet 
guanine-cytosine-cytosine (GCC) is one wherein the portion of the protein responsible for the 
binding to the triplet is a ZF domain within which is contained a segment having the sequence 
Z3XXZ2LXZ1H, wherein Zi is an amino acid selected from the group consisting of histidine, 
lysine, glutamine. asparagine, tyrosine, serine and threonine which interacts with position 1 of 
the DNA triplet; Z2 is an amino acid selected from the group consisting of glutamine, 
asparagine, aspartic acid, serine, threonine, arginine, histidine, and lysine which interacts with 
position 2 of the DNA triplet; Z3 is an amino acid selected from the group consisting of 
glutamine, asparagine, glutamic acid, aspartic acid, histidine, lysine, tyrosine, serine and 
threonine which interacts with position 3 of the DNA triplet; X is an arbitrary amino acid; L is 
leucine and H is histidine. 

It will be appreciated, of course, that DBF's of intermediate affinity, i.e.. ones wherein 
the Z,, Z2 and Zj contact amino acids are selected according to a combination of the "A" and 
"B Rules," can be designed. For example, in the segment ZsXXZjLXZ.H within a ZF domain 
for binding to the triplet GCC, Z, could be an arginine; Z2 could be a glutamine or a glutamic 
acid; and Z3 could be selected from the group consisting of glutamine, asparagine, glutamic 
acid, aspartic acid, histidine, lysine, tyrosine, serine and threonine. 

The basic building block for such proteins is denoted by the formula: 

NHj— ZiF— COOH, 

where ZiF^ is a ZF domain of the form 
Y/FX CX^^ C G/D K/RXFXZ^ XX L XZ^ HX^^ H, where 

Zj, Z^ and Z, are amino acids chosen from Table 1, 2 or 3 to correspond to the three 
bases of the DNA triplet, and the remaining components of the formula are as described earlier 
in the description of Formula I. 

In the preferred embodiment of the invention, a zinc-finger domain for binding to a 
given DNA triplet is designed by seleaion of the appropriate AA's in Table 2 or in column A 
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of Table 1 or Table 3. In another embodiment of the invention, the ZF domain is designed by 
selection from among the AA's set forth for a given DNA triplet in column B of Table 1 or 3. 

One such domain is required for each triplet of the target sequence; for a target string 
of only 3 bases, the above formula defines the protein. 

If the target string of DNA is 6 bases, the DBP design is extended as follows: 

NH^—ZiF— {linker}— ZiFj—COOH 

where ZiF^ and ZiF^ are ZF domains designed, as shown above for ZiFc, to bind to the first 

and seqond triplets of the six bases, and (linker) is an amino acid sequence conforming to the 
pattern 

T/SG/EXciEK/RP, 

again wherein the components are as defined previously in Formula 1. 

If 1) the target string of DNA contains 9, 12, or a higher muhiple of 3 bases; 2) it is 
required to design a DBP for 3n+3 bases; and 3) the DBP for the first 3n bases is given by the 
sequence: 

NH2—ZiF— {linker}— ZiF^— {linker}— ...—{linker}— ZiF—COOH 

then the DBP design is extended recursively and the required DBP is specified by the 
sequence: 

NH2—ZiF— {linker}— ZiF— {linker}— ... 

... — (linker}— ZiF —{linker}— ZiF — COOH 

where ZiF^^, is a ZF domain designed, as shown above for ZiFc. to bind with the n*+l triplet 
of the target sequence of base pairs. 
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Figure 3 provides a schematic representation of a ZF protein wherein n=3, i.e.. one 
which lias 3 ZF domains (i.e., n=3) connected by linker sequences and is designed to bind to a 
target DN A string of 9 (3n) bases. 

The above rules enable ready determination of the optimal amino acid(s) for binding to 
any given DNA triplet and thus the identification and positioning of the 3 amino acids in a ZF 
domain which would be the ideal component of a DBP for binding to the DNA triplet. 

The application of the mles can then be extended to design of a DBP containing a set 
number, nj, of ZF domains, which DPB binds to a target stretch of 3 nj nucleotides within a 
given DNA sequence. The target 3n<i stretch of nucleotides, and the collection and order of nj 
domains in the DBP, are such that the binding energy for the DPB and target DNA sequence 
is the highest possible for any pairing of a DBP containing the set number, n<,. of ZF domains 
with any stretch of Sn^ nucleotides within the entire DNA molecule being screened. 

Accordingly, the embodiment of the invention of primary importance is a method for 
designing such a DBP for a DNA sequence of any length. The method employs the rules 
disclosed above in combination with a means of screening and ranking all possible segments of 
3nd nucleotides within the sequence by their affinities for DBF's containing n<i ZF domains to 
determine a unique DBP with the desired properties. 

More particularly, the invention is directed to a method for designing a DBP, with 
multiple ZF domains connected by linker sequences, that binds selectively to a target DNA 
sequence within a given gene, each of said ZF domains having the formula 



AiXCX2^CA2AjXFXZ..XXZ2LXZ|PDC3.jH 



and each of said linkers having the formula 
A.A5Xo.2EA«P, 



wherein 

(0 X is any amino acid; (ii) Xj-iis a peptide fi-om 2 to 4 amino acids in length; (iii) Xj.jis a 
peptide from 3 to 5 amino adds in length; (iv) X0.2 is a peptide from 0 to 2 amino acids in 
length; (iv) A, is selected from the group consisting of phenylalanine and tyrosine; (v) A2 is 
selected from the group consisting of glycine and aspartic acid; (vi) A3 is selected from the 
group consisting of lysine and arginine; (vii) A4is seleaed from the group consisting of 
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threonine and serine; (viii) A5 is seleaed from the group consisting glycine and glutaniic 
acid; (ix) is selected from the group consisting of lysine and arginine; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) 
P is proline; and (xvi) Zi, Z2 and Z3 are the base-contacting amino acids, comprising the 
steps of: 

(a) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of zinc-finger domains to na; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains nz nucleotides using a first routine where tiz is determined using the 
following 

relationship: 
nz = 3nd; 

(e) assigning base-contacting amino acids at Zi, Z2 and Z3 to each ZF domain, 
according to the A Rules and /or B Rules set forth in Tables 1-3, of a DBP which binds to the 
first nucleotide block from step (d) as numbered from the first 5' nucleotide of the target gene * 
sequence to generate a block-specific DBP and calculating the binding energy. Binding Energy 
block, of each ZF domain of each such block-specific DBP as the product of the binding 
energies, Binding Energy domain, of all zinc-finger domains of the polypeptide, each determined 
using the formula: 

Binding Energy donuin, = (5 x the number of hydrogen bonds) + (2 x the 
number of H2O contacts) + (the number of hydrophobic contacts); 

(f) subdividing the DBP from step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP from step (f) against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula: 

Binding Energy ^u^n = (5 x the number of hydrogen bonds) + (2 x 
the number of H2O contacts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, R^, using a fourth routine for each nucleotide- 
block-specific DBP from step (e) using the following formula: 
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Rt, = Binding Energy b^ck /the sum of ail Binding Energy^,,. 's for all subdivided 

DBF's from step (g); 
(i) repeating steps (f) through (h) for each subdivided DBF wherein > 4; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 

sequence containing n^ nucleotides; 
(k) rank-ordering numerical values obtained from step (h); and 
(1) selecting a DBF with an acceptable value. 

Preferred embodiments of this aspect of the invention are: 

1) the design method as set forth above wherein the DBF Rb numerical value is the 
highest numerical value for all DBF's in step (h) that bind to the target DNA sequence. 

2) the method above wherein the DBF numerical value determined in step (h) is at 
least 10,000. 

3) the method above wherein the number of ZF domains, ju , is nine. 

4) the method above wherein the rules for assigning base-contacting amino acids at 
Z,, Z2 and Z3 for each nucleotide block in step (e) are selected from nile set A. 

The invention is further directed to a computer system for designing a DBF, with 
multiple ZF domains connected by linker sequences, that binds selectively to a target DNA 
sequence within a given gene, each of said ZF domains having the formula 

A,XCX2^CA2A3XFXZ3XXZ2LXZ,HX3.5H 

and each of said linkers having the formula 
A4A5X0-2EA6P, 

wherein 

CO X is any amino acid; (u) Xj^ is a peptide from 2 to 4 amino acids in length; (iii) X3.5 is a 
peptide from 3 to 5 amino acids in length; (iv) X0-2 is a peptide from 0 to 2 amino acids in 
length; (iv) A, is selected from the group consisting of phenylalanine and tyrosine; (v) A2is 
selected from the group consisting of glycine and aspartic acid; (vi) A3 is selected from the 
group consisting of lysine and arginine; (vii) A4is selected from the group consisting of 
threonine and serine; (viii) A5 is selected from the group consisting glycine and glutamic 
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acid; (ix) At is selected from the group consisting of lysine and arginine; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) P 
is proline; and (xvi) Z,, Zj and Z, are the base^contacting amino acids, comprising the steps 

of: 

(a) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of ZF finger domains to nj; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains n, nucleotides using a first routine where n, is determined using the 
following 

relationship: 
Hz = 3nd; 

(e) assigning base-contacting amino acids at Z,, Z^andZj to each ZF domain, 
according to the A Rules and/or B Rules set forth in Tables 1-3, of a DBF which binds to the 
first nucleotide block from step (d) as numbered from the first 5' nucleotide of the target gene 
sequence to generate a block-specific DBF and calculating the binding energy, Binding Energy 
block, of each ZF domain of each such block-specific DBF as the product of the binding 
energies. Binding Energy do«ui„, of all domains of the DBF, each determined using the formula: 

Bmding Energy = (5 x the number of hydrogen bonds) -i- (2 x the 
number of H2O contacts) + (the number of hydrophobic contacts); 

(f) subdividing the DBF from step (d) into blocks using a second routine to generate a 
subdivided DBF having three ZF domains; 

(g) screening the subdivided DBF from step (f) against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBF in the genome and assignuig a binding energy for each such site using the 
following formula: 

Binding Energy ^ter = (5 x the number of hydrogen bonds) + (2 x 
the number of H2O conucts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, R^, using a fourth routine for each nucleotide 
block-specific DBF from step (e) using the following fonnula: 

Rb = Binding Energy /the sum of all Binding Energy^ten's for all subdivided 
DBF's from step (g); 
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(i) repeating steps (f) through (h) for each subdivided DBP wherein nj > 4; 
G) repeating steps (d) through CO for each nucleotide block in the target DNA 

sequence containing Hz nucleotides; 
(k) rank-ordering Rb numerical values obtained from step (h); 
(1) selecting a DBP with an acceptable Rj, value. 

According to the instant invention. Rb, as defined in (h) above for both the design 
method and computer system, has a lower limit of 1 0,000. Preferably Rb is greater than 1 0* 
Preferred embodiments of this aspect of the invention are: 

1 ) the computer system as set forth above wherein the DBP Rb numerical value is the 
highest numerical value for all DBP's in step (h) that bind to the target DNA sequence. 

2) the computer system above wherein the DBP R* numerical value determined in step 
(h) is at least 10,000. 

3) the computer system above wherein the number of ZF domains, n<i, 

is nine. 

4) the computer system above wherein the rules for assigning base-contacting amino 
acids at 2,, Z2 and Zj for each nucleotide block in step (e) are selected from rule set A. 

The method and computer system of the instant invention are further illustrated by the 
block flow diagrams of Figures 4-9. 

Figure 4 shows the components of the computer system on which the DBP design 
process is implemented. A Cemral Processor Digital Computer (1) of any manufacture is 
provided with a Computer Program (2) written by the inventors. This Computer Program (2) 
reads a series of files described as DNA-Triple Energy Rules (6), Genome Descriptors (9), 
Genomic DNA Sequence (10) and Gene Features (5). The Central Processor (1) transforms 
this information imo the DBP Blocking Fragment Files (7) and the Optimal DBP Designs for 
Genome (8). 

Figure 5 shows that the Computer Program (2) in Figure 4 has two portions. The 
genomic data is first transformed by the Process Genome imo Blocking Fragment Files 
fimction (2). These files are then used by the Design DBP's for a Genome fiinction (3). 

The Process Genome into Blocking Fragment Files block (2) of Figure 5 is represemed 
in greater detail in Figure 6. For every n, from 11 down to 3 the Genome Descriptors file (12) 
and the Genome DNA Sequence file (32) are read and transformed into the Unsorted 
Fragment File (7). This same Unsoned Fragment File (14) is transformed by the Sort fiinction 
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(13) provided by the computer manufacturer into the Sorted Fragment file (15). The same 
Sorted Fragment File (30) is read and transformed eventually into the DBP-Size Blocking File 
(22). 

The Design DBF's for a Genome block (3) of Figure 5 is represented in greater detail 
in Figure 7. The Genome Descriptors file (3), the Gene Features file (7), the Genome DNA 
Sequence file (9) and the DBP-Size Blocking Files (37) corresponding to the n^'s fi-om 1 1 
down to 3 are read and used to transform the genomic DNA first into genes and then into a 
file of the Optimal DBF Designs for a Genome (38). The transformation and design process is 
done for all the genes in a genome. 

The "Determine if Current-Sub-Window is in Cunrent-Blocking-File" block (22) in 
Figure 7 is expanded in greater detail in Figure 8. 

The "Calculate Binding-Energy-of-Blocking-Fragment" block (24) in Figure 7 is 
expanded in greater detail in Figure 9. 

By applying the algorithm to a. variety of DBF's of varying n<i, it was experimentally 
determined that a value for n^ of 9 is the best starting point in the algorithm, i.e., the process 
should begin with the search for 9-finger DBF's. This can be better understood in terms of the 
selection criterion, Rb, used in evaluating various DBF's. In shon DBF's, e.g., ones wherein 
nd = 4 or 5, Binding Energy block, which increases geometrically as the product of all Binding 
Energy aom.i„ 's, is significantly lower, and Binding Energy ^.c „ values are relatively large. 
However, as n^ increases, the numerator of increases dramatically, while, it has been 
observed, the denominator, representing "background" or "noise," does not significantly 
change. Thus, the case of nd= 9 provides assurance of high affinity and specificity of binding 
without also bringing on the possibility of undue computational needs. 

However, it should also be emphasized that the present invention is not limited to the 
design of DBF's wherein m <9. For that matter, it will also be appreciated that, while n<j = 9 
has been found to be the best starting point, the best DBF for a given situation may turn out to 
be one wherein m <9, the length of the target DNA sequence notwithstanding. The concept 
of the invention can be applied to the design of DBF's of any length as required. 

In any event, for a given DNA sequence of N nucleotides, there are N - 27, 9-finger 
DBF sequences. Each of these can be ordered in terms of strength of binding by evaluating 
the energy function for each 3-nucleotide segment as set forth in part (e) of the design method 
disclosed above. 
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In initial computationa] experiments, a selectable sequence could have no 8, 7, 6, 5, 
and 4-finger subsites; however, with the present system, only the sum of the subsite binding 
energies need be minimized. As a result, it does not matter whether the subsite binding energy 
comes from 3-finger subsites, 4.finger subsites or even (in principle) larger subsites. This 
simple change from logical exclusion to energetic exclusion has been mandated not so much 
by examination of the yeast genome, but more by examination of the worm genome. 

The central ponion of the instant algorithm is, in tiie case of finding an acceptable n^- 
finger site (e.g., a 27-base segment for a 9-finger DBP), the search against all other n,-finger 
sites in the entire genome to see if there are any similar sites. If such turns out to be the case, 
the DBP with the highest value is selected. Furthermore, the algorithm checks to see if 
there are any equivalent 8-finger, 7-finger, 6-finger, 5-finger and 4-finger subsites in the whole 
genome for a given 9-fmger site. In the event no acceptable 9-finger site is found, the 
algorithm then searches for a suitable 8-finger site. If necessary, the search is continued for a 
7-finger site and so on, until an acceptable DBP binding site is found. 

Within the search for a 9-finger DBP, the algorithm looks at all 27-base sequences, 
which are called "frames." Each frame is evaluated to determine its interaaion with DNA and 
tiie interaction of all other subframes down to 3-finger subsites. The number of instances of 
each frame and subfile in the genome has been recorded during the genome processing 
phase of the execution of the software. The sequence of the frame or subframe is evaluated as 
a product of the binding energy of each ZF. Each ZF domain recognizes tiiree DNA bases. 
The underlying DNA sequence that a ZF recognizes determines how many hydrogen bonds, 
water contacts and hydrophobic contact exist between tiie ZF and the DNA. 

The way the algoritiun detects whether a given n^-base site occurs in other places in 
the genome is by looking in a B-tree for tiie site. The whole genome is processed for each of 
tiie nj-finger sites. The algorithm contains means for sorting and merging tiie myriad 
fragments and, in the end, there is produced an ordered list of all die blocking fragments for all 
the different finger sizes. 



Example 1 

The following is given as an example of how the search for, and design of, a DBP is 
typically carried out. It involves screemng for 9.finger DBP's (i.e., n^ = 9) to bind to a target 
DNA sequence of 100 nucleotides (i.e, N = 100). The sequence is screened, beginning witii 
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position 1, for every 27,nucleotide sequence, i.e., 1-27, 2-28, 3-29 etc., in the entire 100- 
nucleotide sequence. Once this has been done, the 9-fingers are broken down into 3-finger 
sections, i.e., 1-3, 2-4, 3-5 etc. The algorithm scans and looks for relative strengths of 
binding. The idea is to maximize the ratio of DBP binding to subsite binding, R*, thus 
eliminating those 9-mers interacting with the greatest numbers of subsites. 

The algorithm of the present invention was applied to the genomes ofS. cerevisiae and 
C. elegans as illustrated by the foUovwng examples: 

Example 2 

The algorithm has been applied to the screening of the yeast genome. Two 
chromosomes of yeast, containing 1 10 and 447 genes, respectively, have been processed. For 
each gene the algorithm selected the nj-finger sequence with the lowest sum of subsite binding 
energies. In yeast the number of 3-finger blocking fragments is almost maximal (i.e., 4', 
versus 4'- maximal). In the worm genome (see Example 3), the 3-finger blocking sequences 
are absolutely maximal. In yeast the 4-finger blocking sequences are large in number but the 
population of 5-finger blocking sequences is relatively small. In worm the 4-finger blocking 
sequences are larger in number than the 5-finger blocking sequences, but the latter are larger 
in number relative to yeast. In going in the future fi-om worm to human, one can expect that 
the 4-finger blocking sequences might come close to saturation (i.e. close to 4'^). 

The algorithmic analysis was performed for 2 of the 16 chromosomes of yeast. The 
557 genes in the first two chromosomes seem to present a realistic picture of properties of all 
the chromosomes in the yeast genome. Sample calculations have been run on the whole yeast 
genome but these results are not different from those produced by calculating the properties of 
just two chromosomes' worth of genes. The results of the analysis of 100 yeast genes, typical 
of the findings throughout the analysis of the yeast genome, are presented in Table 4. 

The power of the algorithm is fiirther demonstrated in the results displayed in Figures 
10-14. The figures display results obtained for all 557 genes of the two yeast chromosomes 
on which the studies were focused. 

The strength of each acceptable 9-finger DBP can be calculated. Figure 10 shows that 
the strengths of binding of all the acceptable 9-finger DBF's are uniformly distributed. If this 
curve were bowed down, then the stronger firames would be more preferred. If this curve 
were bowed up, then the weaker fi-ames would be preferred. 
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Figure 1 1 shows that the binding energies (Binding Energy dock's) of the acceptable 9- 
finger DBF's are uniformly distributed between lO" and 10^^ binding units. 

Figure 12 shows that the distribution of the sum of the spurious subsite binding 
energies (Binding Energy ^ic 's) is itself uniform in the range of 10* to 10* binding units. 

Figure 13 is a nonlogarithmic version of Figure 12. It shows that most of the 
acceptable 9-finger DBF's have spurious subsite binding energies of less than 5xlO^ 

Figure 14, produced by taking the ratios of the Figure 1 1 values to those of Figure 12, 
is a graph of the Rb's for the 9-finger DBF's. This chart shows that the ratio of the DBF 
binding strength of the acceptable 9-finger DBF's to the sum of the binding energies of the 
spurious subsite interactions varies fi*om 1 0'* to 1 0*. 

The analytical tools of the present invention were also employed in the further analysis 
of a single yeast gene, YAR073, in panicular the 300-bp region of the promoter immediately 
upstream of the coding region. The full sums of the subsite binding energies (SBE's) for each 
27-base frame in this portion of the gene were determined; the results are depicted graphically 
in Figure 15. The primary binding energies (BE's) were also determined, and a correlation 
was found between the SBE values and the values of the ratios of BE:SBE (Rt). Still further 
(Figure 16), it was seen that the peaks of the plot of the Rt values correspond to the footprints 
of the transcription factors of the same gene (determined in a separate study). 

Example 3 

Application of the algorithm according to the instant invention to 100 genes in C 
elegmis showed that the system can be applied as successfiilly to C elegans as to S, 
cerevisiae. The results of analysis of the 100 C. elegans genes are presented in Table 5. 

In Figure 1 7, it can be seen that, for one of the analyzed C. elegans genes, only a 5- 
fiiger DBF could be designed. For another gene, only a 7-finger DBF could be designed. 
These two genes, 2 and 32, are not seen in Table 5, since it presents results of the analysis 
only for those genes (98 out of 100) for which a 9-mer could be designed. In any event, the 
results depicted m Figure 17 are in keeping with the expectation for analysis of the entire C 
elegans genome namely, that the distribution of 5- through 9-finger DBF's is somewhat 
diflFerent than in S. cerevisiae. 



38 



wo 99/42474 



PCTAJS99/03692 



Figure 1 8 represents the same analysis for the C. elegans genes as was depicted in 
Figure 14 for S. cerevisiae genes. Figure 18 shows a similar value distribution to that seen 
in Figure 14. 

Examples 2 and 3 demonstrate the applicability of the instant invention to the design of 
DBF's for the genomes of two widely disparate organisms. The various results of the 
application of the algorithm to the yeast genome, in particular, and also to the worm genome, 
show the power of the algorithmic tool and demonstrate its foundation in reality, i.e., that it 
does not merely provide a random and/or theoretical analysis. It is to be expected, on the 
basis of these analyses, that the inventive algorithm can be extended to the design of DBF's 
for any desired segment of the genome of any organism of interest, including that of a human. 

Although the instant algorithm involves a search against the entire genome of an 
organism, the results of the present studies strongly indicate that lack of complete knowledge 
of the genome of a given organism would not constitute an impediment to application of the 
present invention to the design of DBF's for that organism. One would expect to be able to 
use the knowledge of block sequences obtained in the studies presented herein on S. cerevisiae 
(a unicellular organism) and C. elegans (a multicellular organism) to form valid estimates of 
allowable sequences for the systems of higher eukaryotes. 

For example, the present studies on yeast and worm indicate that the genomic "noise." 
in this context the spurious binding site energies, is relatively constant, and this can be 
projected to higher, more complex organisms as well. In other words, one would expect from 
the demonstrated combinatorics of DNA sequences to be able to extrapolate, or extend, the 
present algorithm to the analysis of more complex genomes, however much is known of the 
specific sequences therein, with the object of designing effective DBF's. Furthermore, as the 
entire genomes of larger organisms, e.g., D. melanogaster, become known, they will provide 
further keys to the analysis of the genomes of higher organisms, including humans. 

A DBF as specified above may be built by using standard protein synthesis techniques; 
or, employing the standard genetic code, may be used as the basis for specifying and 
constructing a gene whose expression is the DBF. 

Froteins so designed can be used in any application requiring accurate and tight 
binding to a DNA target sequence. For example, a DBF, according to the instant invention, 
can be coupled with a DNA endonuclease activity. When the resultant molecule binds to the 
target DNA, said DNA can be cut at a fixed displacement from the DBF binding site. 
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Similarly, in instances in which the target DNA sequence is a promoter, one can 
produce a promoter-specific DBP which, when bound, will act to alter (i.e., enhance, attenuate 
or even terminate) expression of a given gene or, alternatively, genes under control of that 
promoter. 

As another appUcation, a DBP could be designed to bind specific DNA sequences 
when attached to soUd supports. Such solid supports could include styrene beads, acrylamide 
well-plates or glass substrates. 

In order to realize the specific applications mentioned above, as well as the fijll scope 
of applications possible through the instant invention, the DBP can be designed as set forth 
above to include the added feature of a pre- and/or postdomain amino acid sequence of 
arbitrary length. This would include, for example, the coupling of the basic DBP to an 
endonuciease or to a reporter or to a sequence by which the DPB could be coupled to a solid 
support. 

Accordingly, the instant invention includes DBP's that bind to a predetermined target 
double-stranded DNA sequence of 3n (where n>l) base pairs in length of the form: 

NH2 - Xo.„ - ZiF, - [{linker} - ZiF;] ... -[{linker} - ZiF„]- Xo^-COOH 

wherein each ZiF, to ZiF„ is a ZF domain of the foim set forth above; {linker} is an amino acid 
sequence as set forth above: Xoh„ stands for a sequence of fi-om 0 to m amino acids and 
stands for a sequence of fi^m 0 to p amino acids. The values for m and p and the identities of 
the amino acids are determined by the particular protein(s) or amino acid sequence(s) to be 
coupled to the DBP for a given application. 

In a fiirther embodiment of the invention, the Zn*^ atom, which forms a complex with 
the two cysteine and two histidine amino acids in a specific ZF motif, can be substituted by a 
Co*^ or a Cd*^ atom, thus making a "cobalt finger" or a "cadmium finger." 

The rules presented in Table 2 ("rule set A") are to be regarded as the "first choice" 
rales for optimal combinations inZF-DNA recognition. However, it should be emphasized, as 
indicated in column B ("rule set B") of Table 1 or Table 3, that there are many alternative AA 
combinations that would also be expected to be imponant in the design of DNA-binding 
proteins capable of forming usefiil ZF-DNA complexes. 
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What is claimed is: 

L A method for designing a DBP, with multiple ZF domains connected by linker sequences, 
that binds selectively to a target DNA sequence within a given gene, each of said ZF 
domains having the formula 

A,XCX2^CA2A3XFXZ3XXZ2LXZ,HX3.5H 
and each of said linkers having the formula 

A4A5X0.2EA6P, 
wherein 

(i) X is any amino acid; (ii) X2^is a peptide from 2 to 4 amino acids in length; (iii) X3.5 is a 
peptide from 3 to 5 amino acids in length; (iv) X0.2 is a peptide from 0 to 2 amino acids in 
length; (iv) Ai is selected from the group consisting of phenylalanine and tyrosine; (v) A2 is 
selected from the group consisting of glycine and aspartic acid; (vi) A3 is selected from the 
group consisting of lysine and arginine; (vii) A4 is selected from the group consisting of 
threonine and serine; (viii) A5 is selected from the group consisting glycine and glutamic ' 
acid; (ix) A6 is selected from the group consisting of lysine and arginine; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) 
P is proline; and (xvi) Zi, Z2 and Z? are the base-contacting amino acids, which method 
comprises an algorithm comprising the steps of: ? 

(a) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of ZF domains to n<i; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains nz nucleotides using a first routine where nz is determined using the 
following relationship: 

Hz = 3n<i; 

(e) assigning base-contacting amino acids at Zj, Zjand Z? to each ZF domain, 
Lccording to the A Rules and/or B Rules set forth in Tables 1-3 of the specification, of a DBP 



47 



wo 99/42474 PCT/US99/03692 

which binds to the first nucleotide block firom step (d) as numbered fi-om the first 5' nucleotide 
of the target gene sequence to generate a block-specific DBP and calculating the binding 
energy. Binding Energy wock , of each ZF domain of each such block-specific DBP as the 
product of the binding energies, Binding Energy domiin , of all ZF domains of the DBP, each 
determined using the formula: 

Binding Energy donuin = (5 x the number of hydrogen bonds) + (2 x the 
number of H2O contacts) + (the number of hydrophobic contacts); 

(f) subdividing the DBP fi-om step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP fi-om step (f) against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula: 

Binding Energy ,iic„ = (5 x the number of hydrogen bonds) + (2 x 
the number of H2O contacts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, Rb, using a fourth routine for each nucleotide 
block-specific DBP fi-om step (e) using the following formula: 

Ri, = Binding Energy wock /the sum of all Binding Energy.iun 's for all 
subdivided DBP' s firom step (g); 

(i) repeating steps (0 through (h) for each subdivided DBP wherein nd > 4; 

(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 

sequence containing nz nucleotides; 
(k) rank-ordering Rb numerical values obtained from step (h); and 
(1) selecting a DBP with an acceptable Rt> value. 

The method of claim 1 wherein the DBP selected is that whose Kb numerical value is 
the highest numerical value for all DBP's in step (h) that bind to the target DNA 
sequence. 

The method of claim 1 wherein the DBP R*. numerical value determined in step (h) is at 
least 10.000. 
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The method of claim ] wherein the number of ZF domains, tii, 
is nine. 



The method of claim 1 wherein the rules for assigning base-contacting amino acids at 
Zi, Z2 and Z3 for each nucleotide block in step (e) are selected from rule set A. 

The method of claim 1 wherein the rules for assigning base-contacting amino acids at 
Zi, Z2and Zsfor each nucleotide block in step (e) are selected from rule set B. 

The method of claim 1 wherein rules for assigning base-contacting amino acids at Z,, 
Z2 and Z3 for each nucleotide block in step (e) are a combination selected from rule sets 
A and B. 



8. A computer system for designing a DBP, with multiple ZF domains connected by 
linker sequences, that binds selectively to a target DNA sequence %vithin a given gene, each of 
said ZF domains having the formula 

A,XCX2^CA2A3XFXZ3XXZ2LXZ,HX3.5H 



and each of said linkers having the formula 
A4A5X0-2EA6P, 

wherein 

(i) X is any amino acid; (ii) X2-, is a peptide from 2 to 4 amino acids in length; (iii) X3.J is a 
peptide from 3 to 5 amino acids in length; (iv) X0-2 is a peptide from 0 to 2 amino acids in 
length; (iv) A, is selected from the group consisting of phenylalanine and tyrosine; (v) A2 is 
selected from the group consisting of glycine and aspartic acid; (vi) A3 is selected from the 
group consisting of lysine and arginine; (vii) A4 is selected from the group consisting of 
threonine and serine; (viii) A5 is selected from the group consisting glycine and glutamic 
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acid; (ix) A« is selected from the group consisting of lysine and arginine; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) P 
is proline; and (xvi) Z,, Zj and Z3 are the base-contacting amino acids, which computer 
system comprises means for design which include an algorithm comprising the steps of 

(a) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of ZF domains to ru; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains nz nucleotides using a first routine where n, is determined using the 
following relationship: 

nz = 3n<i; 

(e) assigning base-comaaing amino acids at Z,. Zzand Z,to each ZF domain, 
according to the A Rules and/or B Rules set forth in Tables 1-3 of the specification, of a DBF 
which binds to the first nucleotide block fi-om step (d) as numbered from the first 5' nucleotide 
of the target gene sequence to generate a block-specific DBP and calculating the binding 
energy. Binding Energy wock, of each ZF domain of each such block-specific DBP as the 
product of the binding energies. Binding Energy of all ZF domains of the DBP, using the 
formula: 

Binding Energy do™i„ = (5 x the number of hydrogen bonds) + (2 x the 
number of H2O contacts) + (the number of hydrophobic contacts); 

(f) subdividing the DBP from step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP from step (0 against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula: 

Binding Energy,ite„ = (5 x the number of hydrogen bonds) + (2 x 
the number of H2O contacts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, R^, using a fourth routine for each nucleotide 
block-specific DBP from step (e) using the following formula: 

R* = Binding Energy Mock /the sum of all Binding Energyri..„'s for all subdivided 
DBF's from step (g); 
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(i) repeating steps (0 through (h) for each subdivided DBP wherein n<i >4 ; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 

sequence containing n^ nucleotides; 
(k) rank-ordering Rb numerical values obtained from step (h); and 
(1) selecting a DBP with an acceptable R*. value. 

9. The computer system according to claim 8 wherein the DBF selected is that whose R*. 
numerical value is the highest numerical value for all DBF's in step (h) that bind to the target 
DNA sequence. 

10. The computer system according to claim 8 wherein the DBF Rb numerical value 
determined in step (h) is at least 10,000. 

1 1. The computer system according to claim 8 wherein the number of ZF domains, n<,, is 
nine. 

12. The computer system according to claim 8 wherein the mles for assigning base- 
contacting amino acids at Z,, and Z3 for each nucleotide block in step (e) are selected from 
rule set A. 

13. The computer system according to claim 8 wherein the mles for assigning base- 
contacting amino acids at Z,, Zjand Z3 for each nucleotide block in step (e) are selected from 
rule set B. 



14. The computer system according to claim 8 wherein the rules for assigning base- 
contacting amino acids at Z,. Z^ and Z3 for each nucleotide block in step (e) are a combination 
selected from rule sets A and B. 
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FIGURE 5 
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FIGURE 6A 
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FIGURE 6D 
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FIGURE 7A 
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FIGURE 8 
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FIGURE 9A 
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